20 research outputs found
FlashProfile: A Framework for Synthesizing Data Profiles
We address the problem of learning a syntactic profile for a collection of
strings, i.e. a set of regex-like patterns that succinctly describe the
syntactic variations in the strings. Real-world datasets, typically curated
from multiple sources, often contain data in various syntactic formats. Thus,
any data processing task is preceded by the critical step of data format
identification. However, manual inspection of data to identify the different
formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g.
digits, letters, words, etc.), and provide no control over granularity of
profiles. We define syntactic profiling as a problem of clustering strings
based on syntactic similarity, followed by identifying patterns that succinctly
describe each cluster. We present a technique for synthesizing such profiles
over a given language of patterns, that also allows for interactive refinement
by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have
implemented our technique as FlashProfile. Across tasks over large
real datasets, we observe a median profiling time of only s.
Furthermore, we show that access to syntactic profiles may allow for more
accurate synthesis of programs, i.e. using fewer examples, in
programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201
Structure-Grounded Pretraining for Text-to-SQL
Learning to capture text-table alignment is essential for tasks like
text-to-SQL. A model needs to correctly recognize natural language references
to columns and values and to ground them in the given database schema. In this
paper, we present a novel weakly supervised Structure-Grounded pretraining
framework (StruG) for text-to-SQL that can effectively learn to capture
text-table alignment based on a parallel text-table corpus. We identify a set
of novel prediction tasks: column grounding, value grounding and column-value
mapping, and leverage them to pretrain a text-table encoder. Additionally, to
evaluate different methods under more realistic text-table alignment settings,
we create a new evaluation set Spider-Realistic based on Spider dev set with
explicit mentions of column names removed, and adopt eight existing text-to-SQL
datasets for cross-database evaluation. STRUG brings significant improvement
over BERT-LARGE in all settings. Compared with existing pretraining methods
such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms
all baselines on more realistic sets. All the code and data used in this work
is public available at https://aka.ms/strug.Comment: Accepted to NAACL 2021. Please contact the first author for questions
regarding the spider-realistic datase